Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ART: Adaptive Radix Tree implementation #307

Draft
wants to merge 67 commits into
base: main
Choose a base branch
from

Conversation

vishesh
Copy link

@vishesh vishesh commented Aug 3, 2023

This PR implements Adaptive Radix Tree (ART)[1], a compact key-value search tree. It maps byte-sequences to arbitrary values while maintaining data in sorted order. ART can compress multiple nodes into single node and adaptively expand or shrink its internal nodes to avoid worse-case space consumption.

Within the broader swift-collection project, it can be considered a compact variant of SortedSet/SortedDictionary but at the cost of keys being convertible to byte-sequences.

[1] https://ieeexplore.ieee.org/abstract/document/6544812

Checklist

  • I've read the Contribution Guidelines
  • My contributions are licensed under the Swift license.
  • I've followed the coding style of the rest of the project.
  • I've added tests covering all new code paths my change adds to the project (if appropriate).
  • I've added benchmarks covering new functionality (if appropriate).
  • I've verified that my change does not break any existing tests or introduce unexplained benchmark regressions.
  • I've updated the documentation if necessary.

vishesh and others added 13 commits August 2, 2023 17:03
This PR implements Adaptive Radix Tree (ART)[1], a compact key-value search
tree. It maps byte-sequences to arbitrary values while maintaining data in
sorted order. ART can compress multiple nodes into single node and adaptively
expand or shrink its internal nodes to avoid worse-case space consumption.

Within the broader swift-collection project, it can be considered a compact
variant of SortedSet/SortedDictionary but at the cost of keys being convertible
to byte-sequences.

[1] https://ieeexplore.ieee.org/abstract/document/6544812
@vishesh vishesh force-pushed the develop-artree branch 2 times, most recently from 58b664a to 977d7dd Compare August 25, 2023 02:33
Copy link
Member

@ktoso ktoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest that this is a bit outside of the area of my expertise and I wasn't able to provide a lot of valuable input -- I know Karoy has some specific tricks to pull for such collections. However, all those hints are about to become invalidated by the arrival of moveonly types. At the same time, the allocation semantics this tree has I understand are something we specifically want here -- so it is somewhat novel and not much precedent for it.

One thought is to maybe consider using the benchmark package and measure the ARC traffic before we start doing these Unmanaged dances -- it really seems a bit iffy -- also because the manual retain/releases done on the class references are on a strong class reference (buf) to a ManagedBuffer class so that would incur ARC trafi anyway -- probably best to kick off the benchmark package mentioned inline and measure ARC traffic before optimizing.

We discussed with @lorentey and the collection should probably incubate for now outside of swift-collections but we'd like to try to help out polish it up -- even if for a specific use case; which may be easier than bringing it up to the long-term compatibility standards swift-collections would require...

I think that perhaps once we put the collection through some usage that'd be a good driver for brushing it up?

Thanks for the awesome work here so far @vishesh 🙏

//
//===----------------------------------------------------------------------===//

internal struct Const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename: a bit weird to have filename with spaces... it should "work fine" but just in case I'd suggest avoiding that.

static var testPrintAddr = false
}

public protocol ARTreeSpec {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? 🤔

@@ -0,0 +1,25 @@
//===----------------------------------------------------------------------===//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename: a bit weird to have filename with spaces... it should "work fine" but just in case I'd suggest avoiding that.

"ARTree/Node48.swift"
"ARTree/NodeHeader.swift"
"ARTree/NodeLeaf.swift"
"ARTree/NodeStorage.swif"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"ARTree/NodeStorage.swif"
"ARTree/NodeStorage.swift"

public typealias KeyPart = UInt8
public typealias Key = [KeyPart]

protocol ArtNode<Spec> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming consistency: ARTNode

}

var endIndex: Index {
return Index(forTree: self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right?

}

internal var endIndex: Index {
return capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... is it? In other places we bound by the count, so the active element number -- is this ok here?

static var capacity: Int {
@inline(__always) get { 256 }
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the way without going unsafe I think.

///
/// `RadixTree` has the same functionality as a standard `Dictionary`, and it largely implements the
/// same APIs. However, `RadixTree` is optimized specifically for use cases where underlying keys
/// share common prefixes. The underlying data-structure is a _persistent variant of _Adaptive Radix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// share common prefixes. The underlying data-structure is a _persistent variant of _Adaptive Radix
/// share common prefixes. The underlying data-structure is a _persistent_ variant of _Adaptive Radix

static func retainChildren(_ children: Children, count: Int) {
for idx in 0..<count {
if let c = children[idx] {
_ = Unmanaged.passRetained(c.buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really comfortable with the somewhat manual retain dances we're doing here.
On the other hand, I understand the desire -- perhaps the right next step would be to write a benchmark using https://github.com/ordo-one/package-benchmark and observe the refcounts; this way you can actually measure and optimize them.

// It'll report slightly unbalanced retain/release counts, but this is okey: swiftlang/swift#64636 as at least you'll be able to view the numers changing as you tweak

@ktoso
Copy link
Member

ktoso commented Oct 19, 2023

I discussed the refernce counting a bit with @lorentey and here's some notes:

The preferred setup is to either use strong references and automatic retain/release, or to keep child pointers represented with Unmanaged and use _withUnsafeGuaranteedRef and explicit retain/release operations.

Both options have significant drawbacks, but either option is better than trying to do an ad-hoc mix of the two.

FWIW, in the prefix tree implementation of HashTreeCollections, the tree representation uses regular strong references to store the children of each node. There are no explicit retain/release operations anywhere in the code base — everything relies on regular strong reference semantics. (It tries to carefully avoid making any unnecessary copies, to avoid retain/release traffic. For example, child references aren’t loaded into local variables; they are accessed directly via storage pointers.)

(In the CHAMP data structure, child references and leaf items are mixed into the same storage buffer, and that heavily obscures the actual layout.)

That module does also define a separate internal type that wraps an Unmanaged node reference. However, that has very limited functionality — it is primarily designed for use within the Index representation, and it requires extraordinary care to avoid accidentally dereferencing a stale value.

So it really feels like we should try to use a class to do the refcounting and use the package-benchmark to measure and optimize 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants